Naive Bayes (Gaussian, Multinomial, Complement, Bernoulli, Categorical) + Out-of-core#
Naive Bayes is a family of probabilistic, generative models.
It’s popular because it can be:
fast (training is mostly counting)
strong on sparse, high-dimensional data (e.g., text)
surprisingly good even when its assumptions are “wrong”
Learning goals#
By the end you should be able to:
explain Bayes’ rule and why Naive Bayes is a generative classifier
derive the Naive Bayes decision rule in log-space
understand the conditional independence assumption (and its consequences)
implement (from scratch) Gaussian NB, Multinomial NB, and Bernoulli NB
know when to use Complement NB and Categorical NB
train Naive Bayes out-of-core with
partial_fiton streaming batches
Table of contents#
Bayes as “belief update”
The naive assumption (conditional independence)
Gaussian Naive Bayes (continuous features)
Multinomial Naive Bayes (count features)
Bernoulli Naive Bayes (binary features)
Complement Naive Bayes (imbalanced text)
Categorical Naive Bayes (discrete categories)
Out-of-core (streaming) fitting
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from dataclasses import dataclass
from scipy.special import logsumexp
from sklearn.datasets import make_blobs
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import (
GaussianNB,
MultinomialNB,
ComplementNB,
BernoulliNB,
CategoricalNB,
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)
1) Bayes as “belief update”#
Bayes’ rule is just a way to update beliefs when you see evidence:
\(P(y)\) is the prior (what you believed before seeing data)
\(P(x \mid y)\) is the likelihood (how compatible the data is with a hypothesis)
\(P(y \mid x)\) is the posterior (updated belief)
\(P(x)\) is a normalization constant
A helpful mental image#
Think of \(P(y)\) as a base rate and \(P(x \mid y)\) as an evidence multiplier.
Naive Bayes turns this into a classifier by comparing posteriors across classes.
# A classic Bayes example: medical test
# Disease prevalence (prior)
P_D = 0.01
# Test quality
sensitivity = 0.95 # P(test+ | disease)
false_positive_rate = 0.05 # P(test+ | no disease)
# Posterior P(disease | test+)
P_pos = sensitivity * P_D + false_positive_rate * (1 - P_D)
P_D_given_pos = sensitivity * P_D / P_pos
print(f"P(disease) = {P_D:.3f}")
print(f"P(test+ | disease) = {sensitivity:.3f}")
print(f"P(test+ | no disease) = {false_positive_rate:.3f}")
print(f"P(disease | test+) = {P_D_given_pos:.3f}")
fig = go.Figure()
fig.add_trace(go.Bar(x=["prior P(D)", "posterior P(D|+)"] , y=[P_D, P_D_given_pos]))
fig.update_layout(title="Bayes update: rare disease + positive test", yaxis_title="probability", width=650, height=420)
fig.show()
P(disease) = 0.010
P(test+ | disease) = 0.950
P(test+ | no disease) = 0.050
P(disease | test+) = 0.161
2) The naive assumption (conditional independence)#
Naive Bayes assumes:
given the class \(y\), the features \(x_1,\dots,x_d\) are conditionally independent.
Mathematically:
This is “naive” because real-world features are often correlated.
Why it still works often#
The goal of classification is to pick the argmax class. Even if probabilities are imperfect, the ranking can be correct.
Many datasets have “mostly independent enough” signals (especially after preprocessing).
In text, the bag-of-words representation makes independence less crazy than it sounds.
Log-space is your friend#
Products become sums:
This avoids numeric underflow and is computationally convenient.
# Visual demo: correlated features (independence is violated)
# Two classes with correlated 2D Gaussians
n = 700
mean0 = np.array([-1.0, -0.5])
mean1 = np.array([+1.0, +0.5])
cov = np.array([[1.0, 0.85], [0.85, 1.0]]) # strong correlation
X0 = rng.multivariate_normal(mean0, cov, size=n // 2)
X1 = rng.multivariate_normal(mean1, cov, size=n // 2)
X_corr = np.vstack([X0, X1])
y_corr = np.array([0] * (n // 2) + [1] * (n // 2))
corr0 = np.corrcoef(X0.T)[0, 1]
corr1 = np.corrcoef(X1.T)[0, 1]
print(f"Correlation (class 0): {corr0:.3f}")
print(f"Correlation (class 1): {corr1:.3f}")
fig = px.scatter(
x=X_corr[:, 0],
y=X_corr[:, 1],
color=y_corr.astype(str),
title="Correlated features (violates NB independence)",
labels={"x": "x1", "y": "x2", "color": "class"},
)
fig.update_traces(marker=dict(size=6, opacity=0.6))
fig.update_layout(width=720, height=470)
fig.show()
Correlation (class 0): 0.842
Correlation (class 1): 0.825
3) Gaussian Naive Bayes (continuous features)#
Gaussian NB assumes each feature is normally distributed within each class:
Parameter estimation#
For each class \(c\) and feature \(j\):
\(\mu_{c,j}\) is the sample mean
\(\sigma^2_{c,j}\) is the sample variance
Decision rule (log posterior)#
For a sample \(x\):
Where \(\pi_c = P(y=c)\) is the class prior.
Anecdote:
Gaussian NB is like saying: “In class A, feature 1 tends to be around 3 with some wiggle; in class B, it’s around 7…” and doing that for each feature independently.
@dataclass
class ScratchGaussianNB:
var_smoothing: float = 1e-9
def fit(self, X: np.ndarray, y: np.ndarray):
X = np.asarray(X, dtype=float)
y = np.asarray(y)
self.classes_, y_enc = np.unique(y, return_inverse=True)
n_classes = self.classes_.shape[0]
n_features = X.shape[1]
self.class_count_ = np.bincount(y_enc, minlength=n_classes).astype(float)
self.class_prior_ = self.class_count_ / self.class_count_.sum()
self.theta_ = np.zeros((n_classes, n_features), dtype=float) # means
self.var_ = np.zeros((n_classes, n_features), dtype=float) # variances
for c in range(n_classes):
Xc = X[y_enc == c]
self.theta_[c] = Xc.mean(axis=0)
self.var_[c] = Xc.var(axis=0)
# variance smoothing (like sklearn)
overall_var = X.var(axis=0).max() # scalar
self.epsilon_ = self.var_smoothing * overall_var
self.var_ = self.var_ + self.epsilon_
return self
def _joint_log_likelihood(self, X: np.ndarray) -> np.ndarray:
X = np.asarray(X, dtype=float)
# shape: (n_samples, n_classes)
n_samples = X.shape[0]
n_classes = self.classes_.shape[0]
log_prior = np.log(self.class_prior_ + 1e-300)
jll = np.empty((n_samples, n_classes), dtype=float)
for c in range(n_classes):
mean = self.theta_[c]
var = self.var_[c]
# sum_j [ -0.5 log(2π var_j) - (x_j - mean_j)^2 / (2 var_j) ]
log_prob = -0.5 * np.sum(np.log(2.0 * np.pi * var))
log_prob = log_prob - 0.5 * np.sum(((X - mean) ** 2) / var, axis=1)
jll[:, c] = log_prior[c] + log_prob
return jll
def predict_proba(self, X: np.ndarray) -> np.ndarray:
jll = self._joint_log_likelihood(X)
log_norm = logsumexp(jll, axis=1, keepdims=True)
return np.exp(jll - log_norm)
def predict(self, X: np.ndarray) -> np.ndarray:
jll = self._joint_log_likelihood(X)
return self.classes_[np.argmax(jll, axis=1)]
# A clean Gaussian-ish dataset (two blobs)
X_g, y_g = make_blobs(
n_samples=800,
centers=[(-2, -1), (2, 1)],
cluster_std=[1.2, 1.1],
random_state=7,
)
X_tr_g, X_te_g, y_tr_g, y_te_g = train_test_split(X_g, y_g, test_size=0.3, random_state=7, stratify=y_g)
scratch_gnb = ScratchGaussianNB(var_smoothing=1e-9).fit(X_tr_g, y_tr_g)
sk_gnb = GaussianNB(var_smoothing=1e-9).fit(X_tr_g, y_tr_g)
pred_scratch = scratch_gnb.predict(X_te_g)
pred_sklearn = sk_gnb.predict(X_te_g)
print("Scratch GaussianNB accuracy:", accuracy_score(y_te_g, pred_scratch))
print("sklearn GaussianNB accuracy:", accuracy_score(y_te_g, pred_sklearn))
Scratch GaussianNB accuracy: 0.9916666666666667
sklearn GaussianNB accuracy: 0.9916666666666667
def plot_proba_boundary_2d(model, X, y, title: str, grid_steps: int = 220):
x_min, x_max = X[:, 0].min() - 1.0, X[:, 0].max() + 1.0
y_min, y_max = X[:, 1].min() - 1.0, X[:, 1].max() + 1.0
xs = np.linspace(x_min, x_max, grid_steps)
ys = np.linspace(y_min, y_max, grid_steps)
xx, yy = np.meshgrid(xs, ys)
grid = np.c_[xx.ravel(), yy.ravel()]
proba = model.predict_proba(grid)[:, 1].reshape(xx.shape)
fig = go.Figure()
fig.add_trace(go.Contour(
x=xs,
y=ys,
z=proba,
colorscale="RdBu",
opacity=0.75,
contours=dict(showlines=False),
colorbar=dict(title="P(class=1)"),
))
fig.add_trace(go.Scatter(
x=X[:, 0],
y=X[:, 1],
mode="markers",
marker=dict(color=y, colorscale="Viridis", size=6, line=dict(width=0.5, color="white")),
name="data",
))
fig.update_layout(title=title, width=760, height=520)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
return fig
fig1 = plot_proba_boundary_2d(scratch_gnb, X_te_g, y_te_g, "Scratch GaussianNB decision surface")
fig1.show()
fig2 = plot_proba_boundary_2d(sk_gnb, X_te_g, y_te_g, "sklearn GaussianNB decision surface")
fig2.show()
3.1 sklearn GaussianNB parameters#
GaussianNB(priors=None, var_smoothing=1e-9)
priors: manually set class priors \(\pi_c\). Useful when your training data is not representative of deployment.var_smoothing: adds a small value to variances to prevent numerical issues.
Interpretation of var_smoothing:
too small → can blow up when a feature has tiny variance
too large → oversmooths and washes out feature differences
4) Multinomial Naive Bayes (counts / text)#
Multinomial NB is the classic choice for bag-of-words style inputs.
Think:
features are counts (how many times word j appears)
a document is generated by repeatedly sampling words from a class-specific distribution
Model#
For class \(c\), a vocabulary distribution \(\theta_c\) over \(V\) words:
Given a document count vector \(x \in \mathbb{N}^V\):
Taking logs:
Smoothing (Dirichlet / Laplace)#
Without smoothing, unseen words can give \(\theta_{c,j}=0\) and kill probabilities.
With additive smoothing (\(\alpha>0\)):
Anecdote:
Smoothing is like saying: “Even if we haven’t seen the word ‘unicorn’ in spam yet, we won’t assume it’s impossible.”
def make_synthetic_text_dataset(
n_docs: int = 2000,
vocab_size: int = 30,
avg_len: int = 60,
imbalance: float = 0.5,
seed: int = 7,
):
r = np.random.default_rng(seed)
# Class priors
p1 = float(imbalance)
y = (r.random(n_docs) < p1).astype(int)
# Class-specific word distributions (Dirichlet draws)
# Make them different by shifting concentration around two different centers.
base0 = r.random(vocab_size)
base1 = r.random(vocab_size)
base0 = base0 / base0.sum()
base1 = base1 / base1.sum()
# Sharpen and separate distributions
theta0 = r.dirichlet(25 * base0 + 1.0)
theta1 = r.dirichlet(25 * base1 + 1.0)
# Document lengths
lengths = r.poisson(lam=avg_len, size=n_docs) + 5
X = np.zeros((n_docs, vocab_size), dtype=int)
for i in range(n_docs):
theta = theta1 if y[i] == 1 else theta0
X[i] = r.multinomial(n=lengths[i], pvals=theta)
vocab = [f"w{j:02d}" for j in range(vocab_size)]
return X, y, theta0, theta1, vocab
X_counts, y_text, theta0, theta1, vocab = make_synthetic_text_dataset(n_docs=3000, vocab_size=40, avg_len=70, imbalance=0.5)
X_tr_t, X_te_t, y_tr_t, y_te_t = train_test_split(X_counts, y_text, test_size=0.3, random_state=7, stratify=y_text)
# Plot the true underlying word probabilities (top words)
idx0 = np.argsort(theta0)[-10:][::-1]
idx1 = np.argsort(theta1)[-10:][::-1]
fig = go.Figure()
fig.add_trace(go.Bar(x=[vocab[i] for i in idx0], y=theta0[idx0], name="class 0"))
fig.add_trace(go.Bar(x=[vocab[i] for i in idx1], y=theta1[idx1], name="class 1"))
fig.update_layout(
title="Synthetic text: top word probabilities per class (ground truth)",
barmode="group",
xaxis_title="word",
yaxis_title="probability",
width=900,
height=450,
)
fig.show()
@dataclass
class ScratchMultinomialNB:
alpha: float = 1.0
fit_prior: bool = True
def fit(self, X: np.ndarray, y: np.ndarray):
X = np.asarray(X)
if np.any(X < 0):
raise ValueError("MultinomialNB expects non-negative counts")
y = np.asarray(y)
self.classes_, y_enc = np.unique(y, return_inverse=True)
n_classes = self.classes_.shape[0]
n_features = X.shape[1]
class_count = np.bincount(y_enc, minlength=n_classes).astype(float)
if self.fit_prior:
self.class_log_prior_ = np.log(class_count / class_count.sum())
else:
self.class_log_prior_ = np.full(n_classes, -np.log(n_classes), dtype=float)
# feature counts per class
feature_count = np.zeros((n_classes, n_features), dtype=float)
for c in range(n_classes):
feature_count[c] = X[y_enc == c].sum(axis=0)
smoothed_fc = feature_count + self.alpha
smoothed_cc = smoothed_fc.sum(axis=1, keepdims=True)
self.feature_log_prob_ = np.log(smoothed_fc) - np.log(smoothed_cc)
return self
def _joint_log_likelihood(self, X: np.ndarray) -> np.ndarray:
X = np.asarray(X)
return X @ self.feature_log_prob_.T + self.class_log_prior_[None, :]
def predict_proba(self, X: np.ndarray) -> np.ndarray:
jll = self._joint_log_likelihood(X)
return np.exp(jll - logsumexp(jll, axis=1, keepdims=True))
def predict(self, X: np.ndarray) -> np.ndarray:
jll = self._joint_log_likelihood(X)
return self.classes_[np.argmax(jll, axis=1)]
# Compare scratch vs sklearn MultinomialNB
scratch_mnb = ScratchMultinomialNB(alpha=1.0, fit_prior=True).fit(X_tr_t, y_tr_t)
sk_mnb = MultinomialNB(alpha=1.0, fit_prior=True).fit(X_tr_t, y_tr_t)
pred_scratch = scratch_mnb.predict(X_te_t)
pred_sklearn = sk_mnb.predict(X_te_t)
print("Scratch MultinomialNB accuracy:", accuracy_score(y_te_t, pred_scratch))
print("sklearn MultinomialNB accuracy:", accuracy_score(y_te_t, pred_sklearn))
print()
print("Classification report (sklearn MultinomialNB):")
print(classification_report(y_te_t, pred_sklearn, digits=3))
Scratch MultinomialNB accuracy: 1.0
sklearn MultinomialNB accuracy: 1.0
Classification report (sklearn MultinomialNB):
precision recall f1-score support
0 1.000 1.000 1.000 451
1 1.000 1.000 1.000 449
accuracy 1.000 900
macro avg 1.000 1.000 1.000 900
weighted avg 1.000 1.000 1.000 900
# Effect of smoothing alpha
alphas = np.logspace(-3, 1, 20)
acc = []
for a in alphas:
m = MultinomialNB(alpha=float(a)).fit(X_tr_t, y_tr_t)
acc.append(accuracy_score(y_te_t, m.predict(X_te_t)))
fig = go.Figure()
fig.add_trace(go.Scatter(x=alphas, y=acc, mode="lines+markers"))
fig.update_layout(
title="MultinomialNB: test accuracy vs smoothing alpha",
xaxis_title="alpha (log scale)",
yaxis_title="accuracy",
width=800,
height=450,
)
fig.update_xaxes(type="log")
fig.show()
4.1 sklearn MultinomialNB parameters#
MultinomialNB(alpha=1.0, force_alpha=True, fit_prior=True, class_prior=None)
alpha: additive smoothing strengthforce_alpha: ifFalse, may clamp tinyalphavalues for numeric stabilityfit_prior: learn class priors from dataclass_prior: set priors manually (overridesfit_prior)
Rules of thumb:
try
alphain[0.01, 1.0]for textuse
class_priorwhen you know deployment base rates differ from training
5) Bernoulli Naive Bayes (binary features)#
Bernoulli NB is like Multinomial NB, but it cares about presence/absence rather than counts.
For binary \(x_j \in \{0,1\}\):
When does Bernoulli NB shine?
when word frequency is less important than word presence
when you want to explicitly model “word not present” as evidence
In sklearn, BernoulliNB also supports a binarize threshold that turns counts into 0/1.
# Compare BernoulliNB vs MultinomialNB on the same synthetic text
X_tr_bin = (X_tr_t > 0).astype(int)
X_te_bin = (X_te_t > 0).astype(int)
m_mnb = MultinomialNB(alpha=1.0).fit(X_tr_t, y_tr_t)
m_bnb = BernoulliNB(alpha=1.0, binarize=None).fit(X_tr_bin, y_tr_t)
acc_mnb = accuracy_score(y_te_t, m_mnb.predict(X_te_t))
acc_bnb = accuracy_score(y_te_t, m_bnb.predict(X_te_bin))
fig = go.Figure()
fig.add_trace(go.Bar(x=["MultinomialNB (counts)", "BernoulliNB (binary)"] , y=[acc_mnb, acc_bnb]))
fig.update_layout(title="Counts vs binary: accuracy comparison", yaxis_title="accuracy", width=700, height=420)
fig.show()
print("MultinomialNB accuracy:", acc_mnb)
print("BernoulliNB accuracy :", acc_bnb)
MultinomialNB accuracy: 1.0
BernoulliNB accuracy : 1.0
6) Complement Naive Bayes (imbalanced text)#
Complement NB was designed for text classification when classes are imbalanced.
Idea (intuition):
Instead of modeling “what class c looks like”, model “what not c looks like” (the complement).
Then classify by picking the class whose complement is least compatible with the document.
In practice, ComplementNB often improves performance on imbalanced text datasets.
# Make an imbalanced dataset (class 1 is rare)
X_counts_imb, y_imb, _, _, _ = make_synthetic_text_dataset(n_docs=6000, vocab_size=40, avg_len=60, imbalance=0.1, seed=11)
X_tr_i, X_te_i, y_tr_i, y_te_i = train_test_split(X_counts_imb, y_imb, test_size=0.3, random_state=7, stratify=y_imb)
m_mnb_i = MultinomialNB(alpha=1.0).fit(X_tr_i, y_tr_i)
m_cnb_i = ComplementNB(alpha=1.0).fit(X_tr_i, y_tr_i)
pred_mnb = m_mnb_i.predict(X_te_i)
pred_cnb = m_cnb_i.predict(X_te_i)
print("Class balance (test):", np.bincount(y_te_i) / y_te_i.size)
print()
print("MultinomialNB report:")
print(classification_report(y_te_i, pred_mnb, digits=3))
print("ComplementNB report:")
print(classification_report(y_te_i, pred_cnb, digits=3))
Class balance (test): [0.9011 0.0989]
MultinomialNB report:
precision recall f1-score support
0 1.000 1.000 1.000 1622
1 1.000 1.000 1.000 178
accuracy 1.000 1800
macro avg 1.000 1.000 1.000 1800
weighted avg 1.000 1.000 1.000 1800
ComplementNB report:
precision recall f1-score support
0 1.000 1.000 1.000 1622
1 1.000 1.000 1.000 178
accuracy 1.000 1800
macro avg 1.000 1.000 1.000 1800
weighted avg 1.000 1.000 1.000 1800
6.1 sklearn ComplementNB parameters#
ComplementNB(alpha=1.0, force_alpha=True, fit_prior=True, class_prior=None, norm=False)
norm: ifTrue, normalizes weights; sometimes helps.
ComplementNB is typically used for classification (not regression).
7) Categorical Naive Bayes (discrete categories)#
CategoricalNB is for features like:
color ∈ {red, green, blue}
browser ∈ {chrome, safari, firefox}
country ∈ {DE, FR, US, …}
Each feature is an integer code representing a category.
For each class and feature, we estimate a categorical probability table.
CategoricalNB is not the same as one-hot encoding + MultinomialNB.
It’s a direct model of per-feature categorical distributions.
# Toy categorical dataset
# Features: [weather, transport]
# weather: 0=sunny,1=rainy,2=overcast
# transport: 0=car,1=bus,2=bike
# Label: 1=go_out, 0=stay_in
weather = rng.integers(0, 3, size=800)
transport = rng.integers(0, 3, size=800)
# Make a slightly structured rule with noise
p_go_out = (
0.15
+ 0.25 * (weather == 0) # sunny
+ 0.10 * (weather == 2) # overcast
+ 0.15 * (transport == 2) # bike
- 0.15 * (weather == 1) # rainy
)
p_go_out = np.clip(p_go_out, 0.05, 0.95)
y_cat = (rng.random(800) < p_go_out).astype(int)
X_cat = np.c_[weather, transport]
X_tr_c, X_te_c, y_tr_c, y_te_c = train_test_split(X_cat, y_cat, test_size=0.3, random_state=7, stratify=y_cat)
m_cat = CategoricalNB(alpha=1.0).fit(X_tr_c, y_tr_c)
acc_cat = accuracy_score(y_te_c, m_cat.predict(X_te_c))
print("CategoricalNB accuracy:", acc_cat)
# Visualize predicted P(go_out=1) for each combination
combos = np.array([(w, t) for w in range(3) for t in range(3)])
proba = m_cat.predict_proba(combos)[:, 1]
labels_weather = ["sunny", "rainy", "overcast"]
labels_transport = ["car", "bus", "bike"]
z = proba.reshape(3, 3)
fig = go.Figure(data=go.Heatmap(
z=z,
x=labels_transport,
y=labels_weather,
colorscale="Blues",
colorbar=dict(title="P(go_out=1)"),
))
fig.update_layout(title="CategoricalNB: predicted probability table", width=650, height=450)
fig.show()
CategoricalNB accuracy: 0.7166666666666667
7.1 sklearn CategoricalNB parameters#
CategoricalNB(alpha=1.0, force_alpha=True, fit_prior=True, class_prior=None, min_categories=None)
min_categories: force each feature to have at least this many categories (useful if some categories are missing in training).
8) Out-of-core naive Bayes model fitting (partial_fit)#
Sometimes your dataset is too large to fit in memory.
Naive Bayes is great here because you can train it incrementally:
stream data in batches
call
partial_fitrepeatedlythe model updates its sufficient statistics (counts / means / variances)
Important details:
On the first
partial_fit, you must passclasses=np.array([...]).partial_fitis available for several NB variants (including MultinomialNB and GaussianNB).
def stream_synthetic_text_batches(
n_docs: int,
vocab_size: int,
avg_len: int,
class_prior: float,
seed: int,
batch_size: int,
):
r = np.random.default_rng(seed)
# fixed class word distributions
base0 = r.random(vocab_size)
base1 = r.random(vocab_size)
base0 = base0 / base0.sum()
base1 = base1 / base1.sum()
theta0 = r.dirichlet(30 * base0 + 1.0)
theta1 = r.dirichlet(30 * base1 + 1.0)
n_batches = (n_docs + batch_size - 1) // batch_size
for b in range(n_batches):
m = min(batch_size, n_docs - b * batch_size)
y = (r.random(m) < class_prior).astype(int)
lengths = r.poisson(lam=avg_len, size=m) + 5
X = np.zeros((m, vocab_size), dtype=int)
for i in range(m):
theta = theta1 if y[i] == 1 else theta0
X[i] = r.multinomial(n=int(lengths[i]), pvals=theta)
yield X, y
# Fixed test set
X_test_stream, y_test_stream, _, _, _ = make_synthetic_text_dataset(
n_docs=2500, vocab_size=80, avg_len=70, imbalance=0.2, seed=123
)
# Stream training batches
batch_size = 400
stream = stream_synthetic_text_batches(
n_docs=12000,
vocab_size=80,
avg_len=70,
class_prior=0.2,
seed=999,
batch_size=batch_size,
)
m_stream = MultinomialNB(alpha=0.5)
classes = np.array([0, 1])
seen = 0
checkpoints = []
accs = []
for X_batch, y_batch in stream:
if seen == 0:
m_stream.partial_fit(X_batch, y_batch, classes=classes)
else:
m_stream.partial_fit(X_batch, y_batch)
seen += X_batch.shape[0]
if seen % (batch_size * 2) == 0:
y_pred = m_stream.predict(X_test_stream)
checkpoints.append(seen)
accs.append(accuracy_score(y_test_stream, y_pred))
fig = go.Figure()
fig.add_trace(go.Scatter(x=checkpoints, y=accs, mode="lines+markers"))
fig.update_layout(
title="Out-of-core MultinomialNB: accuracy vs streamed samples",
xaxis_title="# samples seen",
yaxis_title="accuracy on fixed test set",
width=850,
height=450,
)
fig.show()
Summary: choosing the right Naive Bayes#
GaussianNB: continuous features; surprisingly strong baseline for numeric data.
MultinomialNB: count data (text word counts, event counts).
BernoulliNB: binary features (word present/absent).
ComplementNB: often better than MultinomialNB on imbalanced text.
CategoricalNB: discrete categorical features (integer-coded categories).
Exercises#
Create a dataset where features are highly correlated and see how GaussianNB degrades.
For MultinomialNB, plot how
alphachanges the top words for each class.For BernoulliNB, compare
binarize=0,binarize=1, and manual binarization.Stream batches with
partial_fitand compare to a singlefit.